Skip to content
Free Tool Arena

How-To & Life · Guide · Text & Writing Utilities

How to Remove Duplicate Lines

Exact vs case-insensitive dedup, preserve-first vs preserve-last, trimmed vs raw comparison, and streaming large files.

Updated April 2026 · 6 min read

Dedup looks like the simplest text operation in the world: remove lines that appear more than once. In reality “duplicate” is a spectrum. Is leading whitespace significant? Does case matter? Should the first occurrence win, or the last? Do trailing spaces make two lines different or the same? And what about a file that’s 10 GB and won’t fit in memory? The right answer depends entirely on what you’re cleaning — email lists, log files, source code, shopping lists — and picking the wrong one can silently discard data you needed. This guide walks through every dedup decision and the patterns that handle each.

Advertisement

Exact vs normalized dedup

Exact dedup compares bytes. Normalized dedup compares after a transformation — lowercase, trim, collapse whitespace, etc. Real-world lists almost always need some normalization, because real-world sources have inconsistent formatting.

inputs:
  "user@example.com"
  "User@Example.com"
  "  user@example.com  "
  "user@example.com\r"

exact dedup:       4 lines (all different)
normalized dedup:  1 line

Case-insensitive dedup

Common for emails, usernames, domains. Build a key by lowercasing, keep the original for output:

const seen = new Set();
const result = [];
for (const line of lines) {
  const key = line.toLowerCase();
  if (!seen.has(key)) {
    seen.add(key);
    result.push(line);   // preserve original case
  }
}

Trimmed comparison

Leading and trailing whitespace silently differentiates identical content. Trim for the comparison, keep whichever version you prefer for output:

const key = line.trim();

For really aggressive matching, also collapse internal whitespace:

const key = line.replace(/\s+/g, " ").trim();

Preserve-first vs preserve-last

When two lines match, which copy do you keep? Default is preserve-first: walk the list, skip anything you’ve seen. Preserve-last requires a second pass:

// preserve-last: keep the LATER occurrence
const map = new Map();
lines.forEach((line, i) => map.set(keyOf(line), { line, i }));
const result = [...map.values()]
  .sort((a, b) => a.i - b.i)
  .map(x => x.line);

Preserve-first is right for logs (earliest record matters). Preserve-last is right for change feeds (last state wins).

Unique vs all-duplicates

Three possible outputs for a deduplication job:

  • Unique — each distinct line once
  • First-occurrence only — preserves order
  • Only duplicates — lines that appeared more than once (opposite direction)
// lines that appeared > 1 time
const counts = new Map();
for (const line of lines) {
  counts.set(line, (counts.get(line) || 0) + 1);
}
const dupes = [...counts.entries()]
  .filter(([_, n]) => n > 1)
  .map(([line]) => line);

Unix: sort | uniq

The classic one-liner. But note: uniq only dedupsconsecutive duplicates, which is why you sort first.

sort input.txt | uniq > output.txt

sort -f input.txt | uniq -i > output.txt   # case-insensitive
sort input.txt | uniq -c > counts.txt      # with counts
sort input.txt | uniq -d > dupes.txt       # only duplicates

Preserving order with awk

Sorting destroys order. awk dedups while preserving the original sequence:

awk '!seen[$0]++' input.txt > output.txt

The trick: seen[$0]++ is 0 on first occurrence (falsy, so ! = true, print), and ≥ 1 thereafter (truthy, so! = false, skip).

Large files: streaming dedup

In-memory Set is O(N) space. For files bigger than RAM, you have two options:

  • External sort + uniq — disk-based, works for any size, O(N log N) time
  • Bloom filter — constant space, probabilistic, may miss rare duplicates or treat uniques as duplicates
# GNU sort spills to disk automatically
sort -u --parallel=4 -S 2G -T /tmp huge.txt > dedup.txt

Hash-based keys for long lines

If lines are very long (>1 KB each) and you have millions of them, storing full lines in a Set wastes memory. Store a hash instead:

import crypto from "crypto";
const seen = new Set();
for (const line of lines) {
  const h = crypto.createHash("sha1").update(line).digest("hex");
  if (!seen.has(h)) { seen.add(h); out.push(line); }
}

SHA-1 collisions on human text are vanishingly rare. For adversarial input, use SHA-256.

Dedup with count column

Sometimes you want the deduplicated list with how many times each appeared. Useful for frequency analysis:

sort input.txt | uniq -c | sort -rn | head -20

The -c flag prefixes counts; sort -rnputs highest first.

CSV dedup by key column

For tabular data, “duplicate” usually means “same value in the key column,” not full-row match. Use a CSV-aware tool:

# csvkit
csvsort -c email input.csv | uniq -f2   # approximate

# or better: load into a script and dedup by column
import csv
seen = set()
with open("in.csv") as f, open("out.csv", "w") as g:
    r = csv.DictReader(f)
    w = csv.DictWriter(g, r.fieldnames)
    w.writeheader()
    for row in r:
        if row["email"].lower() not in seen:
            seen.add(row["email"].lower())
            w.writerow(row)

Common mistakes

Assuming uniq dedups without sorting first. Comparing raw lines without trimming and getting 80% “duplicate” survivors that are actually just whitespace variants. Losing order when order mattered. Discarding the wrong copy (first vs last) for the problem. Running in-memory dedup on a 20 GB file and crashing. And dedup’ing on full rows when only one column mattered.

Run the numbers

Remove duplicate linesText sorterLine counter

Advertisement

Found this useful?Email