How-To & Life · Guide · Developer Utilities
How to diff two files
Unified vs split diff, ignoring whitespace and line endings, word-level vs line-level granularity, and when to use a 3-way merge view.
Diffing two files is one of those operations every developer does a hundred times a day without thinking about what is actually happening. Under the hood, diffcomputes the shortest edit script between two sequences: the smallest set of insertions and deletions that turns one into the other. That is a harder problem than it sounds, and the choice of algorithm, granularity (line, word, character), and output format (unified, context, side-by-side) each have trade-offs. The right choice depends on whether you are reading the diff yourself, feeding it to a patch tool, or comparing structured data where line-level diffs miss the point. This guide covers unified and context formats, line-word-character granularities, three-way merge, the Myers algorithm that most tools use, and when to reach for git diff versus a standalone diff tool.
Advertisement
The edit-script problem
A diff is the answer to: given sequence A and sequence B, what is the shortest sequence of insertions and deletions that turns A into B? Ignoring reorders, the minimum-edit-distance problem has a well-known dynamic-programming solution in O(mn) time and O(mn) space. For a 10,000-line file, that is 100 million cells of memory, which is too much. The Myers algorithm (Eugene Myers, 1986) solves the same problem in O((m+n)D) time where D is the size of the edit script, which is usually much smaller than m+n. Git, most GNUdiff implementations, and most online diff tools all use Myers or a patience variant.
Unified diff format
Unified diff is the format you see in pull requests and git diff output. Each hunk starts with a range header like @@ -10,7 +10,8 @@: starting at line 10 of the old file, 7 lines long, becoming starting at line 10 of the new file, 8 lines long. Lines prefixed with - are removed, + are added, and unprefixed lines are context. The default context is three lines before and after each change, which is enough for a human to orient without drowning in unchanged content.
--- a/config.yml +++ b/config.yml @@ -10,7 +10,8 @@ server: host: localhost - port: 8080 + port: 8443 + tls: true timeout: 30 retries: 3
Context diff format
Context diff is the older format, still in use for some patch-based workflows. Each hunk shows the old block fully (lines prefixed with ! for changed, -for removed) and the new block fully (with ! for changed, + for added). It is more verbose than unified but easier for some humans to read because old and new are shown as intact blocks instead of interleaved. Most modern tools default to unified.
Line granularity
Line-level diffs are the default because most code and text-based formats are line-oriented. A diff that says “line 42 was replaced” is usually precise enough for code review. Line-level diffs break down when a single long line changes: you see the whole line as removed and the whole line as added, which obscures the actual edit. Long YAML values, minified JavaScript, and single-line SVG documents all have this problem.
Word and character granularity
Word-level diffs split the content on whitespace and run the diff algorithm over tokens. Character-level goes one step further. Both produce diffs that highlight the exact sub-line change, which is essential for prose and helpful for long-line structured data. Most diff UIs default to line-level and offer a toggle to “show intra-line changes.” Git’s --word-diff flag produces word-level output.
# Line-level
- The quick brown fox jumps over the lazy dog.
+ The quick red fox leaps over the sleepy dog.
# Word-level
The quick [-brown-]{+red+} fox [-jumps-]{+leaps+} over the [-lazy-]{+sleepy+} dog.Side-by-side view
Side-by-side diffs show the old content in one column and the new content in the other, with aligned lines and highlighted changes. This is the view most IDE diff tools default to because it matches how humans compare things visually. Unified is more compact and better for narrow terminals or for reading a patch file; side-by-side is better for full-file review. Some tools offer a three-column view that adds a common-ancestor column for merge conflicts.
Three-way merge
Two-way diff compares A and B. Three-way merge compares A and B against a common ancestor C, which is what git merge does. If both A and B changed a line relative to C but in different ways, that is a conflict—Git marks it with<<<<<<<, =======, and>>>>>>> markers and leaves it to the human to resolve. If A and B changed different lines, the merge proceeds automatically. Three-way merge is why version control works at all—without it, every concurrent edit would require manual resolution.
Structured diffs
Line-level diffs are meaningless for some formats. A reordered JSON object, a whitespace-only YAML change, or a reformatted SQL query can produce a huge textual diff with zero semantic difference. Structured diff tools parse both documents into their native data model and compare the models directly. jq and customjsondiff libraries handle JSON. SQL parsers produce AST diffs. The tradeoff is complexity: textual diff works on anything, structural diff has to understand the format.
When to use git diff
git diff is the right tool for any file tracked by Git. It handles large files efficiently, shows colorized output, supports word-level with --word-diff, can compare arbitrary commits or branches, and produces patch files withgit diff > my.patch. For files outside a Git repo or for comparing text you just have in clipboards, a standalone diff tool is faster. For comparing two configs from different environments, or two API response payloads, a browser-based diff tool beats setting up a temporary repo.
Patience diff and histogram diff
Myers diff is fast but sometimes produces hunks that align changes poorly—for example, showing a moved function as a big delete and a big insert far apart in the file. Patience diff (Bram Cohen, 2008) and histogram diff (Git’s variant) use common unique lines as anchors to produce more human-readable diffs at the cost of some performance. Git offers --patience and --histogram flags that switch to these algorithms when the default Myers output is noisy.
Common mistakes
Diffing binary files. Textual diff tools produce gibberish on binary input. Use a binary diff tool like bsdiff or compare hash digests for equality checks. Git detects binary files and showsBinary files differ rather than attempting a text diff.
Ignoring line-ending differences. A file moved between Windows and Unix often diffs completely because every line gained or lost a \r. Configure your diff tool or Git to normalize line endings before comparing.
Trusting whitespace-only diffs in code review. A pull request that touches 500 lines but is entirely whitespace changes can hide a single real change. Use the --ignore-all-space flag to see the semantic changes only.
Relying on line-level diff for JSON or YAML. Reordered keys look like huge diffs but change nothing. Use a structured diff tool for configuration data.
Applying a patch to the wrong base. A unified diff contains line numbers and context. If the target file has drifted from the patch’s expected base, the patch will fail or apply to the wrong place. Always verify the patch applies cleanly before relying on the result.
Diffing minified files. One-line minified bundles produce useless line-level diffs. Beautify both sides first, then diff.
Run the numbers
Compare two documents side-by-side or unified in the browser with the diff checker. Pair with the JSON diff checker when both sides are JSON and you need a structural comparison that ignores reordered keys, and the JSON formatter to normalize both inputs before diffing so whitespace and key ordering do not pollute the result.
Advertisement