Functional BytesFunctional Bytes

LLM Evaluation

What is Evaluation

Evaluating the performance and capabilities (output) of an LLM based application checking various metrics such as:

  • Accuracy
  • Coherency
  • Toxicity
  • Many more

Why

Directly evaluating one specific hosted LLM such as OpenAI's GPT ada may not be particularly useful outside of selecting a provider but evaluating the output of your tailored prompts and the flow of your LLM based application (agentic or otherwise) will help track how your application performs while leveraging an LLM.

The concept is not specific to LLMs as evaluating the the same metrics can be applicable to any ML based application (and any other complex algorithmic based process) but with the trend towards incorporating LLMs these notes are mainly framed around their use with LLMs.

In CI

  • Track the results after merging - need to separate what may go in vs. what is in the production (main) code
  • Can use testing frameworks (e.g. jest) to boostrap the flow while developing specific CI actions

Types

  • Reference Free: Evaluate without examples of the expected output
  • Ground Truth: Compare expected/reference output to the result

Approaches

  • LLM as judge evaluator = using an LLM to score the output
  • Programmatic = using custom logic to check if the LLM output and/or function calls or correct

Problems to solve

  • Which measure(s) to use
  • How to run them
  • How to ensure consistency between runs

Terms

TermDefinitions
cotchain of thought - reasoning
reference-free evaluationEvaluate a response without examples/reference data
ground truth evaluationEvaluate a response with an example expected response/data
explicit scorePrompt for a score between two numbers (e.g 1-5)
implicit scorePrompt expecting yes or no (e.g. "Is the submission harmful, offensive, or inappropriate?)

Evaluation Types to be Reviewed

Accuracy/Precision/Recall

  • Accuracy = correct predictions / total predictions (% correct)
    • Can be misleading if dataset is imbalanced (i.e. only 5% are matches, getting all wrong = 95% accuracy)
  • Precision = how often positive predictions are correct = True Positives/ (True Positives + False Positives)
    • Does not account for false negatives
  • Recall = how often true positives are correctly identified = True Positives / ( True Positives + False Negatives)
    • AKA: sensitivity or true positive rate
    • Does not account for false positives

Topics to Research Further

  • Grounding
  • Relevance
  • Efficiency
  • Versatility
  • Hallucinations
  • Toxicity

References