How-To & Life · Guide · Text & Writing Utilities
How to Remove Duplicate Lines
Exact vs case-insensitive dedup, preserve-first vs preserve-last, trimmed vs raw comparison, and streaming large files.
Dedup looks like the simplest text operation in the world: remove lines that appear more than once. In reality “duplicate” is a spectrum. Is leading whitespace significant? Does case matter? Should the first occurrence win, or the last? Do trailing spaces make two lines different or the same? And what about a file that’s 10 GB and won’t fit in memory? The right answer depends entirely on what you’re cleaning — email lists, log files, source code, shopping lists — and picking the wrong one can silently discard data you needed. This guide walks through every dedup decision and the patterns that handle each.
Advertisement
Exact vs normalized dedup
Exact dedup compares bytes. Normalized dedup compares after a transformation — lowercase, trim, collapse whitespace, etc. Real-world lists almost always need some normalization, because real-world sources have inconsistent formatting.
inputs: "user@example.com" "User@Example.com" " user@example.com " "user@example.com\r" exact dedup: 4 lines (all different) normalized dedup: 1 line
Case-insensitive dedup
Common for emails, usernames, domains. Build a key by lowercasing, keep the original for output:
const seen = new Set();
const result = [];
for (const line of lines) {
const key = line.toLowerCase();
if (!seen.has(key)) {
seen.add(key);
result.push(line); // preserve original case
}
}Trimmed comparison
Leading and trailing whitespace silently differentiates identical content. Trim for the comparison, keep whichever version you prefer for output:
const key = line.trim();
For really aggressive matching, also collapse internal whitespace:
const key = line.replace(/\s+/g, " ").trim();
Preserve-first vs preserve-last
When two lines match, which copy do you keep? Default is preserve-first: walk the list, skip anything you’ve seen. Preserve-last requires a second pass:
// preserve-last: keep the LATER occurrence
const map = new Map();
lines.forEach((line, i) => map.set(keyOf(line), { line, i }));
const result = [...map.values()]
.sort((a, b) => a.i - b.i)
.map(x => x.line);Preserve-first is right for logs (earliest record matters). Preserve-last is right for change feeds (last state wins).
Unique vs all-duplicates
Three possible outputs for a deduplication job:
- Unique — each distinct line once
- First-occurrence only — preserves order
- Only duplicates — lines that appeared more than once (opposite direction)
// lines that appeared > 1 time
const counts = new Map();
for (const line of lines) {
counts.set(line, (counts.get(line) || 0) + 1);
}
const dupes = [...counts.entries()]
.filter(([_, n]) => n > 1)
.map(([line]) => line);Unix: sort | uniq
The classic one-liner. But note: uniq only dedupsconsecutive duplicates, which is why you sort first.
sort input.txt | uniq > output.txt sort -f input.txt | uniq -i > output.txt # case-insensitive sort input.txt | uniq -c > counts.txt # with counts sort input.txt | uniq -d > dupes.txt # only duplicates
Preserving order with awk
Sorting destroys order. awk dedups while preserving the original sequence:
awk '!seen[$0]++' input.txt > output.txt
The trick: seen[$0]++ is 0 on first occurrence (falsy, so ! = true, print), and ≥ 1 thereafter (truthy, so! = false, skip).
Large files: streaming dedup
In-memory Set is O(N) space. For files bigger than RAM, you have two options:
- External sort + uniq — disk-based, works for any size, O(N log N) time
- Bloom filter — constant space, probabilistic, may miss rare duplicates or treat uniques as duplicates
# GNU sort spills to disk automatically sort -u --parallel=4 -S 2G -T /tmp huge.txt > dedup.txt
Hash-based keys for long lines
If lines are very long (>1 KB each) and you have millions of them, storing full lines in a Set wastes memory. Store a hash instead:
import crypto from "crypto";
const seen = new Set();
for (const line of lines) {
const h = crypto.createHash("sha1").update(line).digest("hex");
if (!seen.has(h)) { seen.add(h); out.push(line); }
}SHA-1 collisions on human text are vanishingly rare. For adversarial input, use SHA-256.
Dedup with count column
Sometimes you want the deduplicated list with how many times each appeared. Useful for frequency analysis:
sort input.txt | uniq -c | sort -rn | head -20
The -c flag prefixes counts; sort -rnputs highest first.
CSV dedup by key column
For tabular data, “duplicate” usually means “same value in the key column,” not full-row match. Use a CSV-aware tool:
# csvkit
csvsort -c email input.csv | uniq -f2 # approximate
# or better: load into a script and dedup by column
import csv
seen = set()
with open("in.csv") as f, open("out.csv", "w") as g:
r = csv.DictReader(f)
w = csv.DictWriter(g, r.fieldnames)
w.writeheader()
for row in r:
if row["email"].lower() not in seen:
seen.add(row["email"].lower())
w.writerow(row)Common mistakes
Assuming uniq dedups without sorting first. Comparing raw lines without trimming and getting 80% “duplicate” survivors that are actually just whitespace variants. Losing order when order mattered. Discarding the wrong copy (first vs last) for the problem. Running in-memory dedup on a 20 GB file and crashing. And dedup’ing on full rows when only one column mattered.
Run the numbers
Advertisement