You can't see quality regress until a customer does.
You change a prompt, swap a model version, tweak retrieval — and unit tests stay green because the code still runs. But the answers got worse, and nobody notices until a user complains or a number quietly drops. LLM output isn't covered by the tests you already have.
The fix isn't a generic benchmark that has nothing to do with your product. It's a regression set built from the failures you actually hit, graders that agree with how you judge a good answer, and a gate in CI that blocks the release when quality slips — the same way a failing test blocks a bad deploy.