· Generative AI · 3 min read
Building an Enterprise AI Harness: Part 3 - Evaluation and Telemetry
Learn how to build an automated LLM-as-a-Judge evaluation loop and capture telemetry data to ensure your Generative AI applications remain accurate and cost-effective in production.
This is the final instalment of our 3-part deep dive into AI Harness Engineering. Read Part 1 here and Part 2 here.
You have built a secure, model-agnostic gateway. But how do you know if your prompts are actually working? If a developer tweaks the system prompt, how do you ensure it hasn’t degraded the response quality?
In this final part, we will add Continuous Evaluation and Telemetry to our AI Harness.
1. Telemetry: Tracking the Cost of Intelligence
Every time your application calls an LLM, it costs money. To implement chargebacks and identify inefficiencies, your harness must log the exact token usage and latency of every request.
Let’s update our EnterpriseAIHarness from Part 2 to include a telemetry logger.
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("AI_Telemetry")
class TelemetryMiddleware:
def log_request(self, model_id: str, prompt: str, response: str, latency: float):
# In a real enterprise system, you would calculate exact tokens using tiktoken
# Here we use a rough approximation (4 characters per token)
prompt_tokens = len(prompt) // 4
response_tokens = len(response) // 4
total_tokens = prompt_tokens + response_tokens
logger.info(
f"Model: {model_id} | "
f"Latency: {latency:.2f}s | "
f"Total Tokens (approx): {total_tokens} | "
f"Cost (estimated): £{(total_tokens / 1000) * 0.01:.4f}"
)2. Continuous Evaluation (LLM-as-a-Judge)
Evaluating text is hard because there is rarely one “correct” answer. The industry standard is the LLM-as-a-Judge pattern. We use a highly capable model (like Gemini 3.1 Pro or Claude Opus 4.6) to score the outputs of our production models based on a grading rubric.
Here is an out-of-the-box evaluator class:
import google.generativeai as genai
class EvaluatorHarness:
def __init__(self, judge_model_id="gemini-3.1-pro"):
self.judge = genai.GenerativeModel(judge_model_id)
def evaluate_response(self, original_prompt: str, generated_response: str) -> dict:
"""
Uses an LLM judge to evaluate if the response is helpful,
accurate, and free of hallucinations.
"""
evaluation_prompt = f"""
You are an expert AI evaluator. Review the following response to the prompt.
Prompt: "{original_prompt}"
Response: "{generated_response}"
Score the response from 1 to 5 on the following criteria:
1. Accuracy (Is it factually correct?)
2. Helpfulness (Does it answer the user's question?)
3. Tone (Is it professional?)
Output your response strictly in JSON format:
{{"accuracy": X, "helpfulness": Y, "tone": Z, "feedback": "Your reason here"}}
"""
# We enforce a JSON output format
result = self.judge.generate_content(
evaluation_prompt,
generation_config={"response_mime_type": "application/json"}
)
return result.textPutting It All Together
Let’s integrate telemetry and evaluation into the execution loop:
class GovernedAIHarness(EnterpriseAIHarness):
def __init__(self):
super().__init__()
self.telemetry = TelemetryMiddleware()
self.evaluator = EvaluatorHarness()
def execute_with_governance(self, prompt: str, model_id: str = "gemini-3.5-flash"):
start_time = time.time()
# 1. Generate Response (using routing and guardrails from Part 2)
response = self.execute(prompt, model_id)
latency = time.time() - start_time
# 2. Log Telemetry
self.telemetry.log_request(model_id, prompt, response, latency)
# 3. Evaluate Quality (Asynchronous in production)
eval_score = self.evaluator.evaluate_response(prompt, response)
logger.info(f"Evaluation Score: {eval_score}")
return responseBy integrating this harness into your CI/CD pipeline, you can run a suite of 100 test prompts every time a developer changes the system instructions. If the accuracy score drops below 4.0, the pipeline fails.
Conclusion
Building an Enterprise AI Harness is the most critical step in transitioning from AI prototyping to AI production. By centralising routing, implementing strict PII guardrails, and enforcing continuous evaluation, you protect your business while empowering your engineers to build faster.
Why Alps Agility?
At Alps Agility, we combine deep AI expertise with advanced engineering to help you implement autonomous agents that cut costs and improve operational efficiency.
Contact us today to start transforming your enterprise with Agentic AI.
Struggling to move AI from prototype to production? We help enterprises build robust, scalable AI architectures. Book a Generative AI Readiness Assessment.
