Question 1

Isn't this just writing some eval scripts? Why pay for it?

Accepted Answer

The CI plumbing is the commodity part. We ship it, but it's not what you're buying. The hard, valuable work is curation and calibration: deciding which production failures belong in the golden set, writing graders that agree with a human reviewer, and tuning thresholds so the gate catches real regressions without crying wolf on every release. That judgment is the deliverable.

Question 2

What exactly do you grade?

Accepted Answer

The tool ships six grader types: grounding (is the answer supported by the retrieved context), hallucination, relevance, tool-call correctness, output format/schema, and a rubric-based LLM-judge for open-ended quality. We wire the subset your product actually needs: a support bot leans on grounding + hallucination + relevance; an agent leans on tool-call + format.

Question 3

How do I know the gate actually catches regressions?

Accepted Answer

Because we prove it on your own data before handing over. The reference demo: on a grounded support bot the suite passes 100%; we then introduce a silent, plausible-looking rewrite of the bot, and the suite fails with a non-zero exit code, pass-rate collapsing from 100% to 17%, naming which graders regressed. A gate that never fails is theatre. We show you it bites.

Question 4

Where does it run? Do I get locked into a portal?

Accepted Answer

No portal. It's the open-source llm-eval-ci tool running in your CI (GitHub Actions, GitLab, or wherever your PRs live). It exits non-zero when quality drops below threshold, so a bad release is blocked like a failing unit test. You own the golden set, the graders, and the config. We can stop invoicing and nothing breaks.

Question 5

How long does a sprint take?

Accepted Answer

A scoped LLM Eval Sprint is typically 1–2 weeks: a week to mine real failures into a golden set and calibrate graders against your reviewer, a few days to wire the CI gate and tune thresholds, then handoff. The exact window depends on how much labelled failure data exists and how many graders your product needs.

Stop your LLM quality from silently degrading.

You can't see quality regress until a customer does.

A gate that actually bites.

From failures to a gate in 1–2 weeks.

Common questions.

Tell us where your LLM quality could slip.

Make a bad LLM release fail like a bad build.