June 12, 2025·11 min read

What We Learned in 60 Days of Building with GenAI

Six months into our team's GenAI journey, here are the hard-won lessons from 60 days of building, evaluating, and shipping AI-powered features in production.

GenAI

Engineering

Lessons Learned

How do you know if your GenAI system is working? It's the question we find ourselves asking on nearly every consultancy call. With traditional software, "finished" was a clear target. With GenAI — not so much.

Two months ago, our new GenAI in Testing team launched with a simple goal: build brand-new, industry-leading internal GenAI tooling for Quality Engineers (QEs). Fast forward to today and we're fielding back-to-back requests from not just Quality Engineering but across our company. Teams want to adopt the tooling we're building. They want to upskill in this new technology. But it's very clear the approach required for not just building but testing these new systems is still a challenge for those not close to the technology day in and day out.

And it turns out we seem to be repeating the same questions on nearly every consultancy call: how do you know if your GenAI system is working the way you want it to? Traditional testing was a pre-release activity — it was quite clear what "finished" looked like. Not anymore.

After weeks of customer calls, prototypes, failures, and unexpected edge cases, here are the biggest lessons I've learned — lessons I now repeat on every consultancy call.

Test Value Proposition First
If You Can't Define Success, the Model Won't Either
- What to Ask Yourself
Vague Requirements + Ambiguous Data = Garbage In, Garbage Out
LLM Testing Is Looking Like Its Own Discipline
LLM-as-a-Judge: Let the AI Test for You
- Caveats
GenAI Testing Is Not a One-Time Event
Final Thoughts

Test Value Proposition First

This is a significant point I must make clear.

In our short time as a team we've seen a clear recurring issue: every team wants to jump into building and using these shiny new tools before they're clear on the value proposition.

Let me paint a picture for you:

You've spun up a cool new team working on very interesting challenges. Word is getting around. You start offering advice and guidance to spin up little mini-projects for a few teams here and there. One thing leads to another and you now need a solid front door/triage process because you simply can't keep up.

It's tempting to add GenAI because it feels like you need to. But unless you can answer this question:

How will this application help users achieve something faster or better than before?

...then you need to take a step back.

Whether it's ticket triaging, document summarisation, or user guidance — GenAI systems should make something measurably better. And the clearer that benefit is, the easier it becomes for you to:

Define tests
Get relevant stakeholder buy-in
Evaluate the gap between your original vision and the practical reality of what can be built
Iterate or pivot meaningfully

If You Can't Define Success, the Model Won't Either

This is the first thing I tell teams: you need to be crystal clear about what "good" looks like.

GenAI systems aren't deterministic. You won't get the same answer twice, and you can't write traditional tests for either (1) natural language applications or (2) output generation systems (e.g. advice generators, data mutation tools). That means success has to be defined upfront, as clearly and measurably as possible.

What to Ask Yourself

What are we trying to generate — summaries? decisions? advice? emails?
What attributes matter most — accuracy? tone? safety?
How will we evaluate success — human feedback? business KPIs? LLM-as-a-Judge?

Without that scaffolding, you're not evaluating a system — you're taking a gamble. And in enterprise contexts, that's a fast track to failure where there could be a golden application sitting in there somewhere.

Vague Requirements + Ambiguous Data = Garbage In, Garbage Out

One of the first "wow" use cases we stepped into quickly turned from end-to-end system automation to "hmm, what can we do to simply save 5 minutes of time?"

Why such a drastic shift in goals?

It turns out the data source we were using as model input, while large but perfectly manageable by the model or a simple RAG pipeline, had considerable human-embedded knowledge tied to it. The data contained not just a significant number of acronyms but lots of "I know why this is in the data — it's because we used to use X system but that system has since been moved to a new one" sort of context.

In short, we expected the model — which I often tell people to think of as "the smartest brand-new intern you've ever had" — to understand proprietary tribal knowledge our end users had built up over a long time.

We've tested prototype systems where the root cause of failure wasn't the model. In fact, we're quickly finding the models do exactly as they're instructed to. Turns out it's the input ambiguity.

Some examples we've encountered:

Users totally unaware of how to write effective, rock-solid prompts
Users feeding mass unstructured data into the models where human knowledge and context is embedded in the data itself
Wanting to replace entire end-to-end processes where, again, humans already have years of proprietary company experience

In these scenarios, the GenAI system ends up doing too much guesswork. It can't anchor itself to the specific use-case domain. It can't just "get" all the years of work you've built up as a human working on specific human problems.

It's like asking someone to "explain the news, but make it friendly and useful." What news source? Who's the audience? What does "useful" even mean?

If you want your AI to behave like a responsible and capable system, you have to give it structured, domain-aware context. This means:

Curated and filtered source data
Clear instructions
Tight use-case boundaries

Your instructions should include what not to do, not just what to do. These negative constraints can be refined in your prompts as you test and iterate.

LLM Testing Is Looking Like Its Own Discipline

Testing GenAI outputs is not, at least from our perspective, a simple QE extension. It's potentially a new skillset altogether.

You need to:

Design multi-dimensional evaluation suites (e.g. accuracy, compliance, success/failure rates, acceptable benchmark ranges for releases)
Decide whether you're using reference-based (e.g. benchmark tooling) or reference-free evaluations (e.g. Subject Matter Expert ratings, LLM-as-a-Judge)
Build evaluation pipelines that track results not just pre-release but over time — including trigger alerts for output drift from model updates (which could arguably be considered the role of a Data Scientist; this is where roles get fuzzy)

In short, you're not just testing the AI system — you must also test the human and data assumptions around it.

That's why we've found, and this sounds obvious but you'd be surprised, every GenAI initiative needs its own test plan, evaluation criteria, acceptance criteria, and — most importantly — human feedback as early as possible in the process, with which you can refine your test cases.

If you don't know what those are yet, or you don't have SMEs you can rely on, that's your starting point. Assemble that before you start building.

LLM-as-a-Judge: Let the AI Test for You

In one of the first systems we tried to test, we quickly understood that testing responses at scale was the only true way (at least until we find a better way) to understand and create the benchmark with which we can tell our users "Our application gets your use-case correct 80% of the time on average."

How do you approach such a problem? How many cases do you test? More importantly — how do you even review large-scale model outputs?

This is where LLM-as-a-Judge comes in.

At its heart, it's essentially using an LLM to test and evaluate the output of another LLM. The concept is simple: you give a powerful model (though not always the most powerful — sometimes fast and cheap models work well) a clear set of criteria to grade the output of your application. You don't just ask "is this good?" You give it a job to do with formal scoring metrics.

Luckily we have some very smart people who have built in-house LLM-as-a-Judge packages for us to use, but here's how I would approach it:

Create detailed scoring and success metrics. Your judge prompt needs to be even more robust than your application prompt. It should ask the model to evaluate specific criteria. For example: "On a scale of 1 to 5, score the truthfulness of the output against X criteria. Explain your reasoning."
Provide the full context. The judge model needs all the evidence. For each test, provide:
- The base prompt
- The generated output from your application
- (Optionally) A "golden answer" or reference text if one exists
Structured outputs. If you want to automate AI testing at scale, have the judge model return results in a structured format like JSON. Most modern models are great at this. It gives you machine-readable data you can log, automate, aggregate, interrogate, and use to trigger alerts. Example:

{
  "truthfulness_score": 5,
  "reasoning": "The output correctly identified all entities from the source document and did not introduce new information.",
  "is_safe": true,
  "failed_criteria": null,
  "overall_verdict": "Pass"
}

Caveats

Using AI as a judge is never a replacement for human experts — but it can be a powerful aid.
Collaborate with humans. Your SMEs should ideally create a golden dataset of perfect responses (think of a knowledge or process management system). Then use them to spot-check the AI judge's scores to make sure it's working as intended. The end goal is to get the AI judge to a point where it reliably mimics your human experts.
Bias. Each model has its own bias baked into how it was trained. It's very easy to accidentally choose a model that's far too agreeable. Ideally you thoroughly test the judge itself as you would test an application — though don't go too far down that rabbit hole. When does the testing line end?

By using LLM-as-a-Judge you are able to run thousands of automated evaluations, track quality over time, and get detailed feedback at scale. It's a very powerful tool we've added to our GenAI testing toolkit.

GenAI Testing Is Not a One-Time Event

I originally thought of testing as a pre-release activity. We're quickly finding that doesn't work here.

GenAI systems need:

Ongoing human-in-the-loop review, especially during the build process
Post-deployment monitoring to catch model drift, unexpected behaviours (which you can add to your regression suite), and edge cases surfaced by real users

A philosophy we're adopting is that every deployment is essentially a beta. When you push your system to live, it's no longer totally clear what exact outputs you will get. With close relationships with your user SMEs you will be able to iteratively, over time, shape your system to an acceptable level of performance.

Every success could also be treated as a prototype. As the models driving these new types of applications evolve — via model updates or brand-new model releases — our systems themselves also evolve. Each system you build can theoretically improve or regress with subtle changes. A constant prototype phase of testing new models, new prompts, and updated data sources becomes the norm.

All experience from building, testing, and maintaining (which now go hand in hand) can be snowballed into your next project, piggybacking on whatever new capabilities new model releases bring.

Final Thoughts

This space is evolving fast. Very fast. No one has all the answers — especially me. But we are finding there are fundamentals beginning to embed themselves in our operating model.

If you're starting a GenAI initiative inside your company, ask yourself:

How can I define what "good" looks like?
Outside of traditional system testing, what might we want to test about the model's output?
Are you prepared to evolve that definition and maintain testing — not just perform it — throughout the system build journey and post-release?

In my experience, clear thinking at the start is the difference between a GenAI tool that adds real value and one that's just a solution looking for a problem.

Contents