· LLM Finetuning  · 2 min read

Evaluating Fine-Grained Performance in Custom LLMs

How do you know your model is 'good'? Moving beyond loss curves to semantic evaluation frameworks and 'LLM-as-a-Judge'.

How do you know your model is 'good'? Moving beyond loss curves to semantic evaluation frameworks and 'LLM-as-a-Judge'.

Training a model is the easy part; knowing if it’s actually any good is the hard part. A graph might show the model is learning, but is it learning the right thing? Is it helpful? Is it safe? Evaluating AI is tough because there is rarely just one “correct” answer.

How We Evaluate

We use a three-step process to check our work:

1. Basic Checks

First, we use standard software metrics to see if the model’s answers overlap with the “correct” answers. This is a quick and dirty check to make sure it’s on the right track.

2. AI Judging AI

We use a much larger, smarter model (like GPT-4) to act as a judge. We show it the question and our model’s answer, and ask it to score the answer out of 5 for things like:

  • Accuracy: Is the information correct?
  • Tone: Is it professional?
  • Safety: Did it refuse to do anything bad? This works remarkably well and is much faster than human review.

3. Human Feedback (The Gold Standard)

Ultimately, humans are the ones using the tool. We set up “Blind Tests” where experts look at two answers (Answer A and Answer B) without knowing which is which, and pick the winner. This is the final verification step.

The “Golden Set”

Every project starts by creating a “Golden Set”: 100 difficult questions with perfect answers. We use this as a benchmark to measure progress every time we update the model.

At Alps Agility, we ensure that we don’t just deliver a black box. We deliver a thoroughly tested, evaluated, and safe AI solution.

Trust, but verify. Ensure your AI investments are delivering real value. Speak to our Evaluation experts.

Back to Knowledge Hub

Related Posts

View All Posts »