Skip to content
Free Tool Arena

How-To & Life · Guide · Text & Writing Utilities

How to Normalize Unicode Text

NFC/NFD/NFKC/NFKD forms, composed vs decomposed, homoglyph attacks, search indexing, and database key normalization.

Updated April 2026 · 6 min read

Unicode lets the same visible text be encoded multiple ways. The letter “é” can be one code point (U+00E9) or two (U+0065 + U+0301), and both render identically. When you compare two strings, index them in a database, use them as cache keys, or run a regex across them, these equivalent-but-different encodings silently diverge. Unicode normalization forces a canonical form so two “equal” strings actually compare equal. This guide covers the four normalization forms (NFC, NFD, NFKC, NFKD), when to use each, the security implications of homoglyph attacks, and the database and search-index patterns that depend on consistent normalization.

Advertisement

Why normalization exists

Unicode added accented characters in two compatible ways. Legacy precomposed (single code point) and combining (letter plus modifier). Both render identically. Neither is “wrong.” But comparing them requires normalization.

"caf\u00e9"         // 4 code points (precomposed)
"cafe\u0301"        // 5 code points (decomposed)

length:              4          5
===:                 false
after normalize:     both become "caf\u00e9"

The four forms

  • NFC — Canonical Composition. Combines decomposed characters into precomposed form. Usually what you want.
  • NFD — Canonical Decomposition. Splits precomposed characters into base + combining marks. Useful for stripping accents.
  • NFKC — Compatibility Composition. NFC plus compatibility replacements (e.g., full-width to half-width, ligatures to individual letters).
  • NFKD — Compatibility Decomposition. NFD plus compatibility mapping.

NFC: the default for storage and comparison

NFC produces the shortest, most common form. Most of the web stores text in NFC. Compare with NFC for “are these the same user-perceived string” tests.

a.normalize("NFC") === b.normalize("NFC")

NFD: when you want to strip accents

Normalize to decomposed form, then strip combining marks (\p{M}). You get ASCII-ish letters without the diacritics.

"caf\u00e9".normalize("NFD").replace(/\p{M}/gu, "")
// -> "cafe"

This is the backbone of slug generation and accent-insensitive search.

NFKC: lossy but useful

NFKC collapses visual variants to their “plain” form:

"\uFF21\uFF22\uFF23".normalize("NFKC")
// -> "ABC" (full-width to half-width)

"\uFB00".normalize("NFKC")
// -> "ff" (ligature to letters)

"\u00B2".normalize("NFKC")
// -> "2" (superscript to digit)

Great for search and deduplication. Not great for preserving authorial intent — a document’s typographic ligatures and superscripts are meaningful.

NFKD: search-index form

NFKD is the aggressive “one true form” for search: strip compatibility variants and decompose. Then you can strip combining marks for full accent-insensitive indexing.

function searchKey(s) {
  return s
    .normalize("NFKD")
    .replace(/\p{M}/gu, "")
    .toLowerCase();
}

When normalizations disagree

NFC and NFD round-trip safely — convert NFC to NFD back to NFC, you get the original. NFKC and NFKD are lossy. Once you’ve NFKC’d a string containing the ff ligature, you can’t recover the ligature from “ff.”

Database key normalization

If your DB stores user handles, email addresses, or anything user-typed as a primary key, normalize before insert andbefore lookup. Pick NFC for display-preserving storage; NFKC if you want to treat full-width and half-width as equivalent.

INSERT INTO users (handle) VALUES (NFC(input));
SELECT * FROM users WHERE handle = NFC(lookup);

Postgres has normalize() built in. MySQL and SQLite require application-level normalization.

Homoglyph attacks

Attackers exploit visually-similar characters from different scripts. Latin “a” (U+0061) and Cyrillic “а” (U+0430) look identical but are different code points. Normalization doesn’t collapse these — they’re distinct Unicode characters. To defend:

  • Restrict identifiers to a single script (Unicode IDN rules for domains)
  • Flag or block mixed-script strings
  • Use confusables.txt data from Unicode CLDR
  • For passwords and usernames, apply PRECIS profiles (RFC 8264)

Normalization + case folding

Case-insensitive compare needs case folding, not justtoLowerCase. German ß uppercases to SS; Turkish dotless ı and dotted i don’t map the way English expects.

// JS has limited folding
str.normalize("NFC").toLowerCase();

// Intl.Collator handles locale correctly
new Intl.Collator("tr", { sensitivity: "accent" })
  .compare(a, b) === 0

Benchmarking and file size

Normalization is cheap for short strings (microseconds). Large documents (books, corpora) can measure in milliseconds. For streaming pipelines, normalize in chunks that align to grapheme boundaries — don’t normalize half a combining sequence.

Round-tripping through systems

Copy through Windows, macOS, Linux, and web apps, and normalization form can silently change. macOS famously uses NFD for its filesystem, which means file names copied to other systems shift form. Always normalize at boundaries: on input, on storage, on output.

Common mistakes

Comparing strings without normalizing, and wondering why “equal” strings don’t match. Using NFKC for archival storage and losing typographic ligatures. AssumingtoLowerCase is enough for case-insensitive compare across locales. Thinking normalization defends against homoglyphs — it doesn’t. And forgetting that filenames from macOS are often NFD while your database stores NFC, causing case-like mismatches that look impossible.

Run the numbers

Unicode text normalizerInvisible character detectorSpecial character remover

Advertisement

Found this useful?Email