Prompt engineering is software engineering without the version-control discipline. Most teams build prompts the way people built code in 1995 — pasting into a text file and hoping. Here's what real discipline looks like.

Every team that ships LLM features eventually runs into the same wall. Production calls a system prompt called v3-final.txt. Then someone tweaks it for an edge case and saves it as v3-final-FIXED.txt. Then the new version regresses on a different case so someone restores most of v3 but keeps one phrase from FIXED, saving as v3.1-actually-final.txt. Three months later, nobody can reproduce the eval from launch week. The numbers in the original launch blog post can't be verified. There's an open question about whether the production prompt is even the right one.

This is the state of prompt engineering at most teams as of 2026. It's the state JavaScript was in around 1998 — production-critical code being shipped via file-naming conventions and tribal memory. The discipline that turned web development from artisanal hackery into engineering practice is the same discipline prompt engineering needs now: version control, paired evaluations, and rigorous diff hygiene.

This article lays out the minimum-viable version of that discipline. None of it is novel — it's just the application of standard software-engineering practice to a substrate (natural-language prompts) that most teams forgot to apply it to.

The problem in concrete terms

The damage from prompt-versioning sloppiness shows up in three specific ways:

All three failure modes are completely standard for code-with-no-version-control. The unusual thing about prompts is that most teams accept this state as normal.

Baseline #1: Git the prompts

This is the floor. If the prompts aren't in version control, nothing else in this article is reachable.

The right place is the application repo, next to the code that uses the prompt. Not a separate "prompts" repo (drifts from the code), not a Notion page (no diff history), not a Google Doc (likewise), not a database row (no commit messages). A plain text or YAML file in the codebase that the application loads at runtime, with the file under Git like any other source code.

Specific structural choices that pay off:

Baseline #2: Eval-coupled prompts

Storing prompts under Git makes diffs visible. Coupling them to evals makes diffs meaningful.

The pattern: every prompt has a paired eval suite. The eval suite is also in Git, also in the same repo. The CI runs the eval suite on every change to the prompt. The eval results — pass rate, latency, token cost — are recorded in a tracked file or evaluation log that gets committed alongside the prompt change.

What this gets you:

The friction with this discipline is that LLM evals are slow and cost money. A 500-case eval suite that hits a frontier API can cost €5-€20 and take 10-30 minutes. Teams sometimes try to skip running it on small changes. The skipping is exactly where regressions hide.

The mitigation: a small "smoke" suite (20-30 cases) that runs on every prompt change, and the full suite that runs on merge to main or on a daily cron. The smoke suite catches gross regressions in 30 seconds. The full suite catches subtle ones overnight.

Baseline #3: Diff discipline

Once prompts are in Git, you get prompt diffs. Reading them is its own skill.

The dangerous pattern: a team member changes a prompt, the diff shows "+47 -23 lines," and the reviewer skims the diff, sees the new lines look reasonable, and approves. Three weeks later the regression surfaces and review reveals that one of the removed lines was load-bearing in a way nobody noticed.

Better diff discipline looks like:

Baseline #4: Cost-of-change tracking

Beyond accuracy, prompts have two more dimensions worth tracking across versions:

  1. Token count — directly maps to API cost per call. A prompt that doubles in length doubles inference cost. At scale this is meaningful.
  2. Latency — longer prompts take longer to process. For interactive products this matters more than the cost.

Maintain a spreadsheet (literally a spreadsheet, or its CSV equivalent in the repo) of prompt versions × measurement axes:

Version Eval pass rate Tokens Latency (p50) Notes
v1.078%1201.2sbaseline
v1.185%1801.5sadded 3 few-shot examples
v1.286%2201.6s+1 example — diminishing returns
v1.390%1501.3srestructured task definition, removed 2 examples

The v1.3 entry is the punchline of this article and the next section.

The 80% rule

The most consistent observation across teams that have done serious prompt-engineering work for a year or more: 80% of the meaningful improvement comes from clearer task definition, not from clever wording.

The clever-wording trap looks like this: a prompt starts with a clear-but-imperfect task description. Iteration adds qualifiers ("be very precise," "make sure to consider both X and Y"), few-shot examples, edge-case handlers, format specifications. The prompt grows to 400 tokens. Performance creeps up.

Then someone with fresh eyes looks at the original task description and notices it never actually said what the output format should be in unambiguous terms. They rewrite the first paragraph to specify the output crisply, delete most of the qualifiers as redundant, and remove the few-shot examples that were compensating for the vague task definition. The prompt drops to 150 tokens. Performance goes up 4 percentage points.

This pattern repeats across LLM teams often enough to be a reliable heuristic. The implication for iteration order:

  1. Start by getting the task definition crystal clear. Write the prompt as if you were briefing a smart junior employee with no domain context. What outputs are valid? What outputs are not? What edge cases should be handled and how?
  2. Test with the simplest possible prompt. See how the model does without any examples or qualifiers. Often it does well enough that further iteration is barely needed.
  3. Add few-shot examples only where the model demonstrably fails. One example per failure mode, drawn from the eval suite.
  4. Save the wordsmithing for absolute last. Most of the "tone" tweaks are theatre once the task definition is right.

Teams that follow this order end up with shorter, cheaper, more maintainable prompts than teams that start with a long prompt and try to refine it.

Tool walkthrough

The prompt diff tool gives you the token-aware diff view between two prompt versions — paste old and new, see what actually changed beyond what a line-based diff would show. Useful for code review of prompt PRs.

The token counter shows token count for any prompt across major model tokenisers (cl100k for GPT-4 family, the Anthropic tokeniser for Claude, the Llama tokeniser for open-source models). Pasting a prompt before and after a change gives you the cost-delta number that belongs in your review.

The system prompt linter catches common footguns: conflicting instructions, undefined output formats, references to model capabilities that don't exist, ambiguous role specifications. Not exhaustive but catches the obvious mistakes that show up in production-deployed prompts.

Things this article isn't saying

Two important caveats:

Where to read further

Prompt versioning is one of those disciplines where the cost of starting late is high and the cost of starting early is nearly zero. Set up Git for your prompts before you have a regression to chase. Build the eval suite while it's still small. The teams that do this look 6 months later like they've been doing it for years; the teams that don't are still debugging which version of which prompt produced which result.

← All articles