How diff works: the algorithm behind every code review

There are infinitely many ways to describe the difference between two files. "Delete every line, then add every line of the new file" is technically a valid diff — it's just a useless one. What makes diff diff is that it looks for a minimal description: the fewest changed lines that turn old into new. That single design decision produces everything you recognize about a diff, good and weird alike.

The problem: fewest edits, not any edits.

First, a framing choice that matters: classic diff compares lines, not characters. Each file becomes a sequence of lines, and the algorithm asks how one sequence becomes the other using two operations — delete a line, insert a line. Lines are a sweet spot: they're usually the unit a human edited, and comparing a few thousand lines is enormously cheaper than comparing a million characters.

Then the goal: among all edit scripts that transform old into new, find a shortest one. Fewest deletes plus inserts. That's the entire specification of diff — everything else is how to find it efficiently and how to print it.

The longest common subsequence.

Minimizing edits has a mirror image: maximizing what's kept. The lines a diff leaves untouched are lines appearing in both files in the same order — a common subsequence (not necessarily contiguous, unlike a substring). Finding the fewest edits is exactly equivalent to finding the longest common subsequence: keep the LCS, delete every old line outside it, insert every new line outside it.

old:  A  B  C  D  E
new:  A  C  B  E  F

one LCS:  A  C  E     (also A B E — ties are real)

diff:   A            kept
      - B            deleted   (in old here, but not in this LCS)
        C            kept
      - D            deleted
        E            kept
      + F            added
      (new's B reappears as an insert between C and E —
       pick the other LCS and it's C that moves instead)

Notice the tie: A B E and A C E are both longest. Both produce valid, minimal, different diffs — the first blames C and D, the second would blame B and D. Diff output is not unique, which is worth remembering next time two tools disagree about the "same" change.

Myers' algorithm, the short version.

The textbook LCS solution is dynamic programming over an N×M table — fine in theory, wasteful for the typical case where two versions of a file are mostly identical. The algorithm git and GNU diff actually use comes from Eugene Myers' 1986 paper "An O(ND) Difference Algorithm and Its Variations."

Its key idea: think of diffing as walking a grid from the top-left (start of both files) to the bottom-right (end of both). A step right is a deletion, a step down is an insertion — and wherever the current old line equals the current new line, you slide diagonally for free. A shortest edit script is a path with the fewest non-diagonal steps. Myers searches by exploring "how far can I get using 0 edits? 1 edit? 2?" — expanding a frontier greedily along those free diagonals. Its cost is proportional to (file size) × (number of differences), written O(ND): near-linear when files are similar, which is almost always. When two versions differ by three lines, diff doesn't do quadratic work to prove it.

The one-sentence version: diff is a shortest-path search through a grid where matching lines are free moves — and real files are mostly free moves.

Reading unified diff format.

The near-universal output format is the unified diff, and its densest line is the hunk header:

--- a/config.js          the old file
+++ b/config.js          the new file
@@ -14,6 +14,8 @@ function setup()
 unchanged context line
-removed line
+added line
+another added line
 unchanged context line

@@ -14,6 +14,8 @@ reads: this hunk covers 6 lines of the old file starting at line 14, and 8 lines of the new file starting at line 14. The trailing function setup() is a courtesy — the nearest enclosing declaration, so a human knows roughly where they are. Around each change sit (by default) three lines of unchanged context, which is what lets patch and git apply locate a hunk even after the file has shifted, and nearby changes whose context would overlap get merged into one hunk.

Why diffs sometimes look wrong.

Moved code shows as delete-plus-add. The edit model has no "move" operation, so relocating a function is a deletion here and an insertion there — reviewers see 40 red lines and 40 green ones for a change that changed nothing. (Some tools layer move-detection on top afterward, but the underlying diff is still delete+add.)

Whitespace churn drowns the signal. Re-indent a block and every line in it differs. Line-based comparison is exact, so git diff -w (ignore whitespace) exists precisely to recover the real change.

Minimal isn't always human. With repetitive content — say, adding a new brace-delimited block next to an identical-looking one — several minimal diffs exist, and the algorithm may pick one that slices across your mental block boundaries (the classic misaligned-braces diff). Git's --patience and --histogram modes trade strict minimality for alignment on rare, distinctive lines, and usually read better on code.

One character changed, whole line flagged. Lines are the atom, so a one-character typo fix marks the entire line changed. Tools that highlight the changed characters run a second, character-level diff inside the flagged lines — same algorithm, smaller alphabet.

Takeaways.

The thing to remember: a diff is the fewest line deletions and insertions between two files — equivalently, everything outside the longest common subsequence. Myers' algorithm finds it in time proportional to how different the files are. Minimal diffs aren't unique, moves are delete+add, and the @@ header is just "where, and how many lines, on each side."

Diff is quiet infrastructure: forty-year-old algorithmic work you invoke a hundred times a day without thinking. Knowing what it optimizes — and what it structurally can't express — turns confusing diffs from annoyances into things you can explain.

Diff two texts in your browser.

The Diff tool compares two pieces of text side by side and highlights what changed — right in your browser, nothing uploaded. Handy for config files, API responses, and "what did I actually change" moments outside a repo.

Open the Diff tool

Made with love by a very serious person pretending not to be. Tooly McToolface is a workshop of free, client-side web tools. For more algorithm-behind-the-tool reading, why your regex can hang the server covers a search that goes exponentially wrong, and CRC32 vs MD5 vs SHA-256 explains the fingerprints git uses underneath.

The problem: fewest edits, not any edits.

The longest common subsequence.

Myers' algorithm, the short version.

Reading unified diff format.

Why diffs sometimes look wrong.

Takeaways.

Diff two texts in your browser.

More from the workshop.

Why your regex can hang the server.

The Diff tool.