LLM Evaluation
What is Evaluation
Evaluating the performance and capabilities (output) of an LLM based application checking various metrics such as:
- Accuracy
- Coherency
- Toxicity
- Many more
Why
Directly evaluating one specific hosted LLM such as OpenAI's GPT ada may not be particularly useful outside of selecting a provider but evaluating the output of your tailored prompts and the flow of your LLM based application (agentic or otherwise) will help track how your application performs while leveraging an LLM.
The concept is not specific to LLMs as evaluating the the same metrics can be applicable to any ML based application (and any other complex algorithmic based process) but with the trend towards incorporating LLMs these notes are mainly framed around their use with LLMs.
In CI
- Track the results after merging - need to separate what may go in vs. what is in the production (main) code
- Can use testing frameworks (e.g. jest) to boostrap the flow while developing specific CI actions
Types
- Reference Free: Evaluate without examples of the expected output
- Ground Truth: Compare expected/reference output to the result
Approaches
- LLM as judge evaluator = using an LLM to score the output
- Programmatic = using custom logic to check if the LLM output and/or function calls or correct
Problems to solve
- Which measure(s) to use
- How to run them
- How to ensure consistency between runs
Terms
Term | Definitions |
---|---|
cot | chain of thought - reasoning |
reference-free evaluation | Evaluate a response without examples/reference data |
ground truth evaluation | Evaluate a response with an example expected response/data |
explicit score | Prompt for a score between two numbers (e.g 1-5) |
implicit score | Prompt expecting yes or no (e.g. "Is the submission harmful, offensive, or inappropriate?) |
Evaluation Types to be Reviewed
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- Set of metrics comparing n-gram overlaps
- https://en.wikipedia.org/wiki/ROUGE_(metric)
- https://aclanthology.org/W04-1013.pdf
- BLEU
- bilingual evaluation understudy
- https://en.wikipedia.org/wiki/BLEU
- F-score
- harmonic mean of the precision and recall
- https://en.wikipedia.org/wiki/F-score
- Perplexity
- Measure of the confidence a model has in its next word
Accuracy/Precision/Recall
- Accuracy = correct predictions / total predictions (% correct)
- Can be misleading if dataset is imbalanced (i.e. only 5% are matches, getting all wrong = 95% accuracy)
- Precision = how often positive predictions are correct = True Positives/ (True Positives + False Positives)
- Does not account for false negatives
- Recall = how often true positives are correctly identified = True Positives / ( True Positives + False Negatives)
- AKA: sensitivity or true positive rate
- Does not account for false positives
Topics to Research Further
- Grounding
- Relevance
- Efficiency
- Versatility
- Hallucinations
- Toxicity