LLM Evaluation and Telemetry | Enterprise AI Harness Part 3

This is the final instalment of our 3-part deep dive into AI Harness Engineering. Read Part 1 here and Part 2 here.

You have built a secure, model-agnostic gateway. But how do you know if your prompts are actually working? If a developer tweaks the system prompt, how do you ensure it hasn’t degraded the response quality?

In this final part, we will add Continuous Evaluation and Telemetry to our AI Harness.

1. Telemetry: Tracking the Cost of Intelligence

Every time your application calls an LLM, it costs money. To implement chargebacks and identify inefficiencies, your harness must log the exact token usage and latency of every request.

Let’s update our EnterpriseAIHarness from Part 2 to include a telemetry logger.

import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("AI_Telemetry")

class TelemetryMiddleware:
    def log_request(self, model_id: str, prompt: str, response: str, latency: float):
        # In a real enterprise system, you would calculate exact tokens using tiktoken
        # Here we use a rough approximation (4 characters per token)
        prompt_tokens = len(prompt) // 4
        response_tokens = len(response) // 4
        total_tokens = prompt_tokens + response_tokens
        
        logger.info(
            f"Model: {model_id} | "
            f"Latency: {latency:.2f}s | "
            f"Total Tokens (approx): {total_tokens} | "
            f"Cost (estimated): £{(total_tokens / 1000) * 0.01:.4f}"
        )

2. Continuous Evaluation (LLM-as-a-Judge)

Evaluating text is hard because there is rarely one “correct” answer. The industry standard is the LLM-as-a-Judge pattern. We use a highly capable model (like Gemini 3.1 Pro or Claude Opus 4.6) to score the outputs of our production models based on a grading rubric.

Here is an out-of-the-box evaluator class:

import google.generativeai as genai

class EvaluatorHarness:
    def __init__(self, judge_model_id="gemini-3.1-pro"):
        self.judge = genai.GenerativeModel(judge_model_id)

    def evaluate_response(self, original_prompt: str, generated_response: str) -> dict:
        """
        Uses an LLM judge to evaluate if the response is helpful, 
        accurate, and free of hallucinations.
        """
        evaluation_prompt = f"""
        You are an expert AI evaluator. Review the following response to the prompt.
        
        Prompt: "{original_prompt}"
        Response: "{generated_response}"
        
        Score the response from 1 to 5 on the following criteria:
        1. Accuracy (Is it factually correct?)
        2. Helpfulness (Does it answer the user's question?)
        3. Tone (Is it professional?)
        
        Output your response strictly in JSON format:
        {{"accuracy": X, "helpfulness": Y, "tone": Z, "feedback": "Your reason here"}}
        """
        
        # We enforce a JSON output format
        result = self.judge.generate_content(
            evaluation_prompt,
            generation_config={"response_mime_type": "application/json"}
        )
        return result.text

Putting It All Together

Let’s integrate telemetry and evaluation into the execution loop:

class GovernedAIHarness(EnterpriseAIHarness):
    def __init__(self):
        super().__init__()
        self.telemetry = TelemetryMiddleware()
        self.evaluator = EvaluatorHarness()

    def execute_with_governance(self, prompt: str, model_id: str = "gemini-3.5-flash"):
        start_time = time.time()
        
        # 1. Generate Response (using routing and guardrails from Part 2)
        response = self.execute(prompt, model_id)
        
        latency = time.time() - start_time
        
        # 2. Log Telemetry
        self.telemetry.log_request(model_id, prompt, response, latency)
        
        # 3. Evaluate Quality (Asynchronous in production)
        eval_score = self.evaluator.evaluate_response(prompt, response)
        logger.info(f"Evaluation Score: {eval_score}")
        
        return response

By integrating this harness into your CI/CD pipeline, you can run a suite of 100 test prompts every time a developer changes the system instructions. If the accuracy score drops below 4.0, the pipeline fails.

Conclusion

Building an Enterprise AI Harness is the most critical step in transitioning from AI prototyping to AI production. By centralising routing, implementing strict PII guardrails, and enforcing continuous evaluation, you protect your business while empowering your engineers to build faster.

Why Alps Agility?

At Alps Agility, we combine deep AI expertise with advanced engineering to help you implement autonomous agents that cut costs and improve operational efficiency.

Contact us today to start transforming your enterprise with Agentic AI.

Struggling to move AI from prototype to production? We help enterprises build robust, scalable AI architectures. Book a Generative AI Readiness Assessment.

Building an Enterprise AI Harness: Part 3 - Evaluation and Telemetry

1. Telemetry: Tracking the Cost of Intelligence

2. Continuous Evaluation (LLM-as-a-Judge)

Putting It All Together

Conclusion

Why Alps Agility?

Related Posts

Building an Enterprise AI Harness: Part 2 - Core Routing and Guardrails

Building an Enterprise AI Harness: Part 1 - Architecture & Foundations

Enterprise Generative AI: The 2026 Deployment Whitepaper

Slashing Cloud Costs with Generative FinOps