How-To & Life · Guide · Text & Writing Utilities
How to Remove Duplicate Lines
Dedup text lines using exact, trimmed, or case-insensitive comparison free online. Process large files and choose preserve-first or preserve-last instantly with no signup.
Dedup looks like the simplest text operation in the world: remove lines that appear more than once. In reality “duplicate” is a spectrum. Is leading whitespace significant? Does case matter? Should the first occurrence win, or the last? Do trailing spaces make two lines different or the same? And what about a file that’s 10 GB and won’t fit in memory? The right answer depends entirely on what you’re cleaning — email lists, log files, source code, shopping lists — and picking the wrong one can silently discard data you needed. This guide walks through every dedup decision and the patterns that handle each.
Advertisement
Exact vs normalized dedup
Exact dedup compares bytes. Normalized dedup compares after a transformation — lowercase, trim, collapse whitespace, etc. Real-world lists almost always need some normalization, because real-world sources have inconsistent formatting.
inputs: "user@example.com" "User@Example.com" " user@example.com " "user@example.com\r" exact dedup: 4 lines (all different) normalized dedup: 1 line
Case-insensitive dedup
Common for emails, usernames, domains. Build a key by lowercasing, keep the original for output:
const seen = new Set();
const result = [];
for (const line of lines) {
const key = line.toLowerCase();
if (!seen.has(key)) {
seen.add(key);
result.push(line); // preserve original case
}
}Trimmed comparison
Leading and trailing whitespace silently differentiates identical content. Trim for the comparison, keep whichever version you prefer for output:
const key = line.trim();
For really aggressive matching, also collapse internal whitespace:
const key = line.replace(/\s+/g, " ").trim();
Preserve-first vs preserve-last
When two lines match, which copy do you keep? Default is preserve-first: walk the list, skip anything you’ve seen. Preserve-last requires a second pass:
// preserve-last: keep the LATER occurrence
const map = new Map();
lines.forEach((line, i) => map.set(keyOf(line), { line, i }));
const result = [...map.values()]
.sort((a, b) => a.i - b.i)
.map(x => x.line);Preserve-first is right for logs (earliest record matters). Preserve-last is right for change feeds (last state wins).
Unique vs all-duplicates
Three possible outputs for a deduplication job:
- Unique — each distinct line once
- First-occurrence only — preserves order
- Only duplicates — lines that appeared more than once (opposite direction)
// lines that appeared > 1 time
const counts = new Map();
for (const line of lines) {
counts.set(line, (counts.get(line) || 0) + 1);
}
const dupes = [...counts.entries()]
.filter(([_, n]) => n > 1)
.map(([line]) => line);Unix: sort | uniq
The classic one-liner. But note: uniq only dedupsconsecutive duplicates, which is why you sort first.
sort input.txt | uniq > output.txt sort -f input.txt | uniq -i > output.txt # case-insensitive sort input.txt | uniq -c > counts.txt # with counts sort input.txt | uniq -d > dupes.txt # only duplicates
Preserving order with awk
Sorting destroys order. awk dedups while preserving the original sequence:
awk '!seen[$0]++' input.txt > output.txt
The trick: seen[$0]++ is 0 on first occurrence (falsy, so ! = true, print), and ≥ 1 thereafter (truthy, so! = false, skip).
Large files: streaming dedup
In-memory Set is O(N) space. For files bigger than RAM, you have two options:
- External sort + uniq — disk-based, works for any size, O(N log N) time
- Bloom filter — constant space, probabilistic, may miss rare duplicates or treat uniques as duplicates
# GNU sort spills to disk automatically sort -u --parallel=4 -S 2G -T /tmp huge.txt > dedup.txt
Hash-based keys for long lines
If lines are very long (>1 KB each) and you have millions of them, storing full lines in a Set wastes memory. Store a hash instead:
import crypto from "crypto";
const seen = new Set();
for (const line of lines) {
const h = crypto.createHash("sha1").update(line).digest("hex");
if (!seen.has(h)) { seen.add(h); out.push(line); }
}SHA-1 collisions on human text are vanishingly rare. For adversarial input, use SHA-256.
Dedup with count column
Sometimes you want the deduplicated list with how many times each appeared. Useful for frequency analysis:
sort input.txt | uniq -c | sort -rn | head -20
The -c flag prefixes counts; sort -rnputs highest first.
CSV dedup by key column
For tabular data, “duplicate” usually means “same value in the key column,” not full-row match. Use a CSV-aware tool:
# csvkit
csvsort -c email input.csv | uniq -f2 # approximate
# or better: load into a script and dedup by column
import csv
seen = set()
with open("in.csv") as f, open("out.csv", "w") as g:
r = csv.DictReader(f)
w = csv.DictWriter(g, r.fieldnames)
w.writeheader()
for row in r:
if row["email"].lower() not in seen:
seen.add(row["email"].lower())
w.writerow(row)Common mistakes
Assuming uniq dedups without sorting first. Comparing raw lines without trimming and getting 80% “duplicate” survivors that are actually just whitespace variants. Losing order when order mattered. Discarding the wrong copy (first vs last) for the problem. Running in-memory dedup on a 20 GB file and crashing. And dedup’ing on full rows when only one column mattered.
Run the numbers
Use these while you read
Tools that pair with this guide
- Remove Duplicate LinesPaste text and remove duplicate lines instantly with one click. Adjust case sensitivity and whitespace handling right in your browser, completely free.Text & Writing Utilities
- Text SorterSort lines alphabetically, reverse alphabetically, or by length. Options to dedupe and ignore case. Free, private.Text & Writing Utilities
- Line CounterCount total lines, blank lines, comment lines, and code lines in any paste. Great for LOC estimates and log review.Developer Utilities
- HTML Table GeneratorBuild an HTML table visually and copy the markup. Supports header row, striped rows, borders, and alignment.Text & Writing Utilities
Advertisement
Continue reading
- How-To & LifeHow to Convert to snake_caseConvert text to snake_case online instantly. Free tool handles acronyms and PascalCase for Python and Ruby naming with no download needed.
- How-To & LifeHow to Write Numbers in WordsConvert any number into words for checks or legal documents instantly. This free online tool handles cardinal and ordinal formats, with no registration or sign-up needed.
- How-To & LifeHow to Count Word FrequencyCount word frequency online instantly with our free text analyzer. Remove stop-words and explore n-grams for SEO research without registration.
- How-To & LifeHow to Detect Invisible CharactersFind zero-width joiners, non-breakers, and byte order marks that break your regex after paste. Use this free instant guide for clean text processing online.
- How-To & LifeHow to Normalize Unicode TextNormalize Unicode text using NFC, NFD, NFKC, and NFKD forms for search and security. A free online guide to handling homoglyph attacks and key normalization in seconds.
- How-To & LifeHow to Strip Special CharactersStrip special characters from text: define what 'special' means, clean to ASCII‑only, preserve spaces and punctuation, and produce URL‑safe output. Free guide, no sign‑up.